One would be forgiven to think that artificial neural networks are the newest and shiniest of modern data science. On the contrary, the main concepts have been around for decades. But it is recent progress in computational resources and the availability of massive datasets that these learning architectures revealed their true powers. AlphaGo, Siri and Alexa, self-driving cars are all running on some form or other of deep artificial neural networks.
The hype means the Internet is aflush with tutorials and online resources to get started. Yet, somehow R hasn’t gotten much street cred in the area. Most of the frameworks are implemented in Python, and so are the tutorials. R is supposed to be the de facto lingua franca of statistical computing, so what’s up with that?
What follows is a custom build of a simple one hidden-layer neural network, where we’ll save just enough parameters at every iteration to be able to gganimate the training process afterwards.
## Warning: package 'gridExtra' was built under R version 3.4.2
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database
This post is mostly inspired by Andrew Ng’s Deep Learning course (including the dataset), which I strongly recommend for anyone interested in neural networks. The task is to predict the color of the points in the plot on the right. While it seems like a trivial problem, linear algorithms will inevitably fail at it because the colors are not linearly separable. There’s no single line we can draw that perfectly separates the red dots from the blue dots.
The input layer contains the input data. The number of nodes in the input layer is always equal to the number of features in the data (X, Y coordinates). The input is then 1. propagated forward through the hidden layer. Number of hidden nodes and number of hidden layers can be modified at will. The edges between the nodes represent the weights, and the prediction is essentially a function of these weights.
Once the data has been passed through the entire network, we get the predictions in the output node and 2. compute the cost with respect to the actual labels. At each iteration, we adjust the weights to minimize this cost as much possible.
How do we do that? That’s the job of 3. backward propagation. By means of gradient descent, we calculate the partial derivatives of all computations with respect to what came after, alas we go backwards. First the derivatives of the weights of the hidden layer with respect to the output layer, and secondly those of the input layer with respect to the hidden layer. The gradients we obtain are then used to update the weights and start the process all over again. With each pass - also called epochs, we get closer to the optimal weights.
I will now explain in short and code up each of the three computations. The scripts we define will be used inside a single function call that trains the neural network. Significant overhead will be introduced by saving parameters at every iteration, but hopefully the animated plots will be worth it.
Forward propagation is the first pass through the data, calculating an output based on the weights of each edge. The connections from the input layer to each of the hidden nodes is a linear function, followed by an activation function.
The computational steps of forward propagations are \(Input -> Hidden -> Output\) .
If we break down each of the two connections into a linear function \(Z^{[i]}\) and an activation function \(A^{[i]}\), the architecture becomes \(X ->Z^{[1]}->A^{[1]}->Z^{[2]}->A^{[2]}\) with \(X\) as the input data.
The activation function is usually a non-linear function that enables the network to cope with non-linear problems. Examples include the sigmoid function, relu or tanh.
Let’s take for example the connections going from the input layer to one hidden node. If \(X_{m,n}\) is the vertically stacked dataset where m = number of features (2) , n = number of observations, \(w\) is a weight vector of length m; the linear function in one hidden node can be formally represented as a matrix vector product:
\[\begin{align*} w = \begin{pmatrix} w_{1} \\ w_{2} \\ \end{pmatrix}; X = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,n} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,n} \\ \end{pmatrix} \end{align*}\] \[\begin{align*} Z = {w^T}X + b= \begin{pmatrix} w_{1} & w_{2} \\ \end{pmatrix} \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,n} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,n} \\ \end{pmatrix} + b \end{align*}\] \[\begin{align*} Z = \begin{pmatrix} w_{1}x_{1,1} + w_{2}x_{2,1}+b & w_{1}x_{1,2}+ + w_{2}x_{2,2}+b & \cdots & w_{1}x_{1,n} + w_{2}x_{2,n}+b \end{pmatrix} \end{align*}\]The activation function \(A^{[1]}\) is the tanh \(A^{[1]} = \tanh(Z^{[1]})\), for the output layer we’ll use the sigmoid instead \(A^{[2]} = \sigma(Z^{[2]})\). The computation can also be visualised as a subgraph of our neural network:
It turns out that this implementation scales to multiple hidden nodes without any formal change to the math. Instead of a weight vector \(w\), we are computing the same functions using a weight matrix \(W\). The matrix-vector product now becomes a dot product between the two matrices. With each node in the hidden layer, we add an extra row in the transposed weight matrix. The dimensionality requirements of matrix multiplication are kept intact: The number of columns of first matrix still equal the number of rows of the second. But the dimensions of the output change accordingly. We go from a transposed vector of length n to an m x n matrix where m = the number of hidden nodes.
\[\begin{align*} Z = {W^T}X + b= \begin{pmatrix} w_{1,1} & w_{1,2} \\ \vdots & \vdots \\ w_{n,1} & w_{n,2} \end{pmatrix} \begin{pmatrix} x_{1,1} & \cdots & x_{1,n} \\ x_{2,1} & \cdots & x_{2,n} \\ \end{pmatrix} + b \end{align*}\] \[\begin{align*} Z = \begin{pmatrix} w_{1,1}x_{1,1} + w_{1,2}x_{2,1}+b & \cdots & w_{1,1}x_{1,n} + w_{1,2}x_{2,n}+b \\ \vdots & \ddots & \vdots\\ w_{n,1}x_{1,1} + w_{n,2}x_{2,1}+b & \cdots & w_{n,1}x_{1,n} + w_{n,2}x_{2,n}+b \end{pmatrix} \end{align*}\]The last step of going from the hidden layer to the output layer follows the same algebra. I’ll spare you the nitty gritty. Before we propagate forward for the first time, it is important to randomly initialize the weights. Otherwise each connection will compute the exact same thing.
initialize_parameters <- function(n_x, n_h, n_y) {
set.seed(2)
# W1 -- weight matrix of shape (n_h, n_x)
W1 = matrix(rnorm(n_x*n_h), nrow = n_h, ncol = n_x) * 0.01
# b1 -- bias vector of shape (n_h, 1)
b1 = rep(0, n_h)
# W2 -- weight matrix of shape (n_y, n_h)
W2 = matrix(rnorm(n_h*n_y), nrow = n_y, ncol = n_h) * 0.01
# b2 -- bias vector of shape (n_y, 1)
b2 = rep(0, n_y)
parameters = list(W1 = W1,b1 = b1,W2 = W2,b2 = b2)
return(parameters)
}
Remember the schema is \(X ->Z^{[1]}->A^{[1]}->Z^{[2]}->A^{[2]}\). Both \(Z^{[1]}\) and \(Z^{[2]}\) are the same linear function, while \(A^{[1]} = \tanh(Z^{[1]})\) and \(A^{[2]} = \sigma(Z^{[2]})\). The sigmoid function didn’t make it to base R, so we define it first.
sigmoid <- function(x) {
1 / (1 + exp(-x))
}
forward_propagation <- function(X, parameters) {
# Retrieve each parameter from the list "parameters"
W1 <- parameters$W1; b1 <- parameters$b1
W2 <- parameters$W2; b2 <- parameters$b2
# Forward propagation
Z1 = W1 %*% X + b1
A1 = tanh(Z1)
Z2 = W2 %*% A1 + b2
A2 = sigmoid(Z2)
cache <- list(Z1=Z1,A1=A1,Z2=Z2,A2=A2)
return(cache)
}
Each pass of forward propagation ends with a prediction. Generating a prediction for every pixel of our plot raster, we can simulate decision boundaries. As the algorithm learns, the borders slowly align with the data:
As we have seen above, forward propagation is nothing more than a predict function. When the dataset has been passed through the network, we get an output that can be compared to the actual label. The purpose of the cost function is to inform the model how far the output is from the target value. One of the most popular cost functions is log loss, formally known as:
\[J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small Y\log\left(A^{[2]}\right) + (1-Y)\log\left(1- A^{[2]}\right) \large \right) \small \]
compute_cost <- function(A2, Y) {
# Number of observations
m <- dim(Y)[2]
cost <- -1/m * sum(Y * log(A2) + (1-Y)*log(1-A2))
return(cost)
}
You saw how the algorithm was getting better at predicting the colors with each iteration. This is the result of reducing the cost as the model learns.